library(phonto)
library(DT)

NHANES

The National Health and Nutrition Examination Survey (NHANES) datasets are collected from the Centers for Disease Control and Prevention in the USA, including demographics, dietary, laboratory, examination, and questionnaire data.

The survey is carried out in two year “epochs” from 1999-2000 to 2017-2018, and that within an epoch a set of people are surveyed. However, not all of the participants are surveyed for all of the components (e.g. demographics, dietary, laboratory, examination, and questionnaire) of the survey.

Each participant is assigned a unique ID and that is stored in the database as which is then used as a primary key. Any merging of data extracted from different tables should be based on . Within each of the two-year epochs NHANES has produced a set of tables, or data files. One example is the Body Measures table, which provides data from the 2001-2002 surveys. The web page provides details on the measurements and how they are recorded. Here two examples: SEQN weight

RG: But, importantly, there are no people that span two (or more epochs), so any joins have to be within a 2 year epoch.

Laha: Actually, is it possible that people may assigned new SEQNs in different “epochs”? do we know about that? Yes, you can read up their documentation, but they move to different parts of the country and the sampling fraction is very low, so the probability of the same person being surveyed twice is very small. ***

RG Within each survey epoch the data are organized in named tables (you can outline the notion of the SAS name, the short variable name, the longer name and so on).

Laha: We do not have the SAS names in the database for now. I can copy them from the website, but I think we want to a function to retrieve those information from the database. We should have them soon.

The CDC provides substantial web-based tools for investigating the data and metadata has produced an R package called nhanesA(Endres et. al). Most of the data collected can easily be downloaded from the CDC website, however accessing the data in that way can be problematic and error prone. nhanesA package provides a set of tools to search and download the data and metadata, which makes the data more accessible to the users. However, the function needs to access the NHANES website every time the user calls the R function, which leads to slow getting data and raises errors occasionally due to network issues.

In the spirit of producing more easily reproduced and shared analyses we have created a SQL database in a Docker (cite Docker) container that contains most of the available data. Around that SQL database we have constructed a number of R packages and other tools that help scientists analyze the data. Since all of the code and data are under version control, anyone can obtain the same version of the data and code and get identical results.

In this notebook, we demonstrate the data manipulation with our R functions to show how those functions can help us in our work.

We can load the data if we know which data file we want to load by using nhanes() function. Otherwise, we can retrieve the data as following steps - search the tables/data files though variable names or tables names - load the data with unionQuery or jointQuery

0. We might want to outline some search strategies.

How do i find out about variables that encode Height information? Can you run through a set of commands that would get someone information about which tables they are in, then use that to get a data frame with height in it?

Laha: Agreed, but currently we can only search with the variable names like BMXHT, which not user-friendly. I will revise the function so support names like,“Height”,“height”,“HEIGHT”,once we have the metadata in the database.

1. searchTableByName()

This function lets users search for tables in the database by the table name, which can be either the short names used by NHANES or the longer descriptive names.

FIXME: is this correct Laha? If so can you add something to the man page to show this and an example of using the long descriptive names - also the man page says: includerdc: If TRUE then RDC only tables are included (default=FALSE). but what is an RDC table? can you explain on the man page? searchTableByName(): can search the tables by the name patterns, and it returns a data frame of the results, which contains the questionnaire (the table names provided in NHANES websites), a longer descriptive table name (eg.BloodPressure) and the survey years.

No, it does not support search the long tale name, it was designed for search short table names as the nhanesA package does. But I can added it if that is what we want.

For example, we can search blood pressure related table by the following code. We choose “BPX” as the parameters based the CDC table name conventions.

res = searchTableByName("BPX")
datatable(res)

In the research data frame, Questionnaire present CDC tables, and TableName present long descriptive names which hold the data across the years.

The above search result includes the table BloodPressureOscillometricMeasurements, and a different kind of measurements of blood pressures, which may not want what we want. Then, we can replace the search pattern “BPX” with “BPX[_],” which matches only the table containing the string “BPX_”; therefore, the “BPXO_J” will be excluded. t the BPX table with no undescore. FIXME: this next sentence is encouraging you - Laha: You must put yourself into the frame of mind of the naive user and explain things in great detail here.

res = searchTableByName("BPX[_]")
datatable(res)

2. unionQuery()

The unionQuery() function is designed to aggregate data across the years. Tt aggregate the researched results and returns the results as a data frame. For example, the blood pressure tables in CDC are BPX, BPX_B,…BPX_J from years 1999-2000 to 2017-2018. The users can aggregate some or all of the data.

##FIXME - here we disagree - I think this function is designed to aggregate data, within a table, across years - you might need to explain how the BPX type tables differ across years - but if we want blood pressure for everyone over the whole range of the study we need to somehow query across the set of tables.
##FIXME: we do that for different (maybe they are conceptual tables) - and maybe we need to explain that more up at the top. Again you could start by saying that they measured blood pressure every epoch, that’s how you get the tables BPX,…BPX_J. Maybe you show that they didn’t measure all the same things every survey. Maybe you explain that BPXO_J is another way of measuring blood pressure.

tablnames = searchTableByName('BPX[_]')
blood_df <- unionQuery(tablnames$TableName,cols =c("BPXDI1","BPXDI2","BPXSY1","BPXSY2"))
DT::datatable(blood_df)

3. jointQuery()

The jointQuery() function, table list of table name and a set of column names, it merges the researched results and returns the results as a data frame. The data are join by SEQN because it is unique id to the epochs. ##FIXME: ok this one is more interesting - but you need to perhaps explain that SEQN is the primary key in this database and that jointQuery is going to merge on that quantity. And this might be where you want to be clear that SEQNs are unique to epochs - they don’t appear in more than one, and even if it was the same person there is no way to know that.

The function support the long decriptive name like

tables <- c("DemographicVariablesAndSampleWeights","BloodPressure") 
cols <- c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2","years","BPXDI1","BPXDI2","BPXSY1","BPXSY2")
data <- jointQuery(tables,cols)
datatable(data)
#> Warning in instance$preRenderHook(instance): It seems your data is too big
#> for client-side DataTables. You may consider server-side processing: https://
#> rstudio.github.io/DT/server.html

and we can also use the CDC table names if we know table names,

The function support the long descriptive name like

tables <- c("DEMO_C","BPX_C") 
cols <- c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2","years","BPXDI1","BPXDI2","BPXSY1","BPXSY2")
data <- jointQuery(tables,cols)
datatable(data)

4. PHESANT-like

We can run the PHESANT-like process to convert each column into data types. It also provides the ratio of unique values (r_unique), the proportion of zeros (r_zeros), and the ratio of NAs (r_NAs), which is calculated by the number of unique values, zeros, and NAs divided by total records. The categorical data types (ordered or unrecorded) are presented by integers, and the PHESANT function category them as multilevel. For example, education (DMDEDUC2) is category as Multilevel-7 means the PHESANT process considers it multilevel and has 7 levels.

phs_dat = phesant(data)
data = phs_dat$data
DT::datatable(phs_dat$phs_res)

5.Setup factor levels for catigorycial varaibles

Categorical variables are presented with integers as shown below.

data[,c('RIAGENDR', 'RIDRETH1','DMDEDUC2')] |> head() |> knitr::kable()
RIAGENDR RIDRETH1 DMDEDUC2
2 4 NA
1 1 NA
1 1 NA
2 3 NA
2 1 3
2 3 NA

The real factor levels can be set with nhanesTranslate() functions from nhanesA package as shown below.

# data = nhanesA::nhanesTranslate('DEMO_D', c('RIAGENDR', 'RIDRETH1','DMDEDUC2'), data=data[1:1000,])
# # head(data)
# datatable(data[,c('SEQN','RIAGENDR', 'RIDRETH1','DMDEDUC2')])

##FIXME: Laha - somewhere you should have an explicit definition of what is in the code - what makes something continuous - what makesit discrete, what makes it ordered? In the first pass, all the categorical values are discrete integers - that does not make them ordered - you have to look at the actual levels - and we need to explain how our users will do that.

##FIXME - the more I think about it, we should only report a factor, if the column has been transformed into an R factor, and then you can get from that whether it is ordered or unordered. When you get a column with integer values only (and less than whatever the PHESANT cut-off is, then call them “categorical” - let’s reserve the label “factor” for R factors.

Currently, we are doing as the following flow chat, but both the ordered and unordered are considered as multilevel. PHESANT flow chat